
CAUTION i THESE RECORDS WILL SE USED 
FOR OFFICIAL PURPOSES ONLY, DO N 
REMOVE PAPERS NOR REVEAL CONTESTS 
TO UNAUTHORIZED PERSON IS) 



DATE OF REQUEST 1^ SUSPENSE DATE 

Jan 6L 10 Feb 61 



PIUS OR 
SERIAL 
NUMBER 
AND 



(From File of Special Consultant (Friedman) 
Statistics for Cryptology 



RETURN 

TO 



NAME AND EXTENSION OF PERSON REQUESTING FILE 

Mr. William Friedman LI 6-8520 



Mrs. Christian. AG-24* NSA. Ft. Geo* G. Meade. Md 










REF ID:A64688 



GONFIDENTIAL - 

Modifled Handling Authorized 



NATIONAL SECURITY AGENCY 
WASHINGTON 25, D. C. 



STATISTICS FOR CRYPTOLOGY 



NOTICE: This material contains information affecting the National Defense of the United States 
within the meaning of the espionage laws, Title 18, U.S.C., Sections 793 and 794, the transmission 
or the revelation of which in any manner to an unauthorized person is prohibited by law. 



C O NFIDEN T IAL 



I 



ORIGINAL 
Reverse (Page II) Blank 





CONFIDENTIA L 



REF ID : A64 688 



NATIONAL SECURITY AGENCY 
WASHINGTON 25, D. C. 



This edition of "Statistics for Cryptology” is published for use in training programs. Com- 
ments and suggestions for the improvement of this text are invited, and should be forwarded to the 
Director, National Security Agency (Attn: TNG). 



CONFIDENTIAL 



III 



ORIGINAL 
Reverse (Page IV) Blank 



REF ID : A64 688 



CONFIDENTIAL 



PREFACE 

My indebtedness to Dr. Kullback is acknowledged for his careful reading of the manuscript 
and his many suggestions. Credit is also due to my colleagues in NSA, too numerous to name 
individually, who have discovered some of this theory and have called much of the rest to my 
attention. 



HOWARD H. CAMPAIGNE 



CONFIDENTIAL 



V 



ORIGINAL 
Reverse (Page VI) Blank 



REF ID : A64 688 



TABLE OF CONTENTS 



Section Paragraphs 

0. INTRODUCTION 

Information Theory 0,1 ... 

Statistics 0,2 ... 



1. REVIEW OF ELEMENTARY MATHEMATICAL STATISTICS 

The Binomial Distribution 1,1. 

The Multinomial Distribution 1,2 . 

The Poisson Distribution 1,8. 

The Normal Distribution 1,4. 

The Chi-Squared Distribution. 1,5. 

The Zipf Distribution 1,6. 

Approximate Distribution 1,7. 

Regression 1,8 . 

2. THE MATCHING OF DISTRIBUTIONS 

Goodness of Fit Defined 2,1 .. 

Goodness of Fit of Two Samples 2,2 .. 

Bayes* Theorem t ._. 2,8 .. 

Example of Application 2,4 .. 

Repeated Applications 2,5 .. 

Example of tiie Calculation of Bayes Factors 2,6 .. 

Weights 2,7 .. 

Rounded Weights and Risk-Admission Diagrams 2,8 .. 

Two Category Weights 2,9 .. 

Three Category Weights. 2,10 .. 

Statistics of Bayes Factors 2,11 .. 

3. UNPREJUDICED ESTIMATES OF UNIVERSES 

The Law of Succession 8,1 .. 

Code Groups 3,2 .. 

4. SOME MATRIX DEFINITIONS AND PROPERTIES 

Elementary Properties of Matrices - 4,1.. 

Determinants.. 4,2.. 

Inverses and Conjugate Transposes of Matrices. 4,3 .. 

Vectors 4,4 .. 

Geometry. 4,5 .. 

The Line of Regression 4,6.. 

Examples of Cryptologic Applications 4,7 .. 



FLAGGING 

Rectangles 5,1 

Flags 5,2 

An Automatic Technique for Converging a Flag.. 5,3 

FOURIER TRANSFORMS 

Definitions 6,1 

Properties of the Fourier Transforms 6,2 

Real Part 6,3 

Absolute Value. 6,4 

Application to Minuend Systems.. . 6,5 

THEORY OF CIRCULICES 

Enciphering Equations 7,1 

Properties of Circulices 7,2 

Fourier Transforms and Circulices 7,8 

Polynomials and Circulices 7,4 



ORIGINAL 
Reverse (Page VIII) Blank 



REF ID : A64 688 



CONFIDENTIAL 



STATISTICS FOR CRYPTOLOGY 

0. Introduction. 

This treatise is a concise exposition of the mathematical statistics which has applications in 
cryptology. It is aimed at a hypothetical cryptanalyst who has had college mathematics and 
forgotten most of it, or perhaps has had none but has done fairly extensive study on his own. 
The object here is to give definitions and develop properties in a simple and straightforward way. 
This treatise is not intended to be selfcontained but is intended to be readable. Many concepts 
will be given without following them very far. 

The theory held here is that the most formidable computation is relatively easy to under- 
stand if the reasons for it are clear. Therefore this exposition gives the basis for computational 
and statistical procedures, but does not dwell on their details, which are available elsewhere. 
The illustrative examples have been selected to have a minimum of computation, and are there- 
fore trivial. 

0, 1 Information Theory. 

0, 1, 1 Coding. 

Cryptology is concerned almost entirely with telecommunications, the transmission of in- 
formation over long distances by means of radio or wire lines. In each method of telecommu- 
nication the information is coded in some way, such as Morse for hand sending, or amplitude or 
frequency modulation for speech. Of course in enciphered communications further complica- 
tions are deliberately introduced, but the basic coding is unavoidable. It can usually be done 
in a variety of ways, depending upon whether brevity or reliability is more desirable. 

0, 1, 1, 1 Efficient Coding. 

The most common coding is binary, which we can represent on paper as 0’s and l’s, that is, 
some signal can be on or off, and combinations of these on-off signal elements are used to repre- 
sent elements of writing or of speech. One of the most used codes is the Baudot Code, in which 
each combination has five elements, and there are 2* -32 distinct combinations. These 32 
combinations are used to represent the 26 letters of the alphabet and a few functions, such as 
carriage return, platen advance (line feed), word space, shift to upper case, etc. 

By contrast the Morse code does not have a fixed number of elements for each letter, nor 
are the signal elements the same length. There are two conditions of the signal, on and off, and 
each of these is used in two versions, short (dot) and long (dash). Because of the latter, the 
two conditions of on and off must alternate. 

Communicators encounter the need for sending messages rapidly, and ask themselves the 
questions: Wbat coding conveys the most information in a given time? Which coding sends 
the information most reliably? Mr. Morse attempted to make his code efficient for English, for 
he assigned the short combinations to the frequent letters, dot for E and dash for T. He as- 
signed the longest combinations to the least frequent letters, such as dot dash dash dash for J. 
There are even longer combinations for the numerals, comma, and so forth. 

There is a principle of information theory which says that the more effectively a communi- 
cation system is used the more it sounds like noise. Thus the Morse code will transmit a maxi- 
mum of information when there are lots of E’s and T’s and only a few J's and Q’s. The Baudot 
code will be at its best when the letters (including carriage returns, line feeds, and so forth) are 
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equally frequent. If the frequencies deviate from this then a code can be tailored to be more 
effective. 

Making an effective code depends on knowing in advance the frequencies with which the 
letters (or words and phrases in the case of book codes, to which all of this applies) will occur. 
The effectiveness of the code will be no greater than the accuracy of the prediction of the statis- 
tics. 



Given a frequency count there are some simple rules enabling us to construct an efficient 
code. Below is a frequency count and a binary code made to transmit rapidly sets of these 
symbols in these proportions. 





Frequency 


Code 


Weighted Length 


A 


.30 


10 


.60 


B 


.25 


00 


.50 


C 


.15 


010 


.45 


D 


.10 


011 


.30 


E 


.07 


1100 


.28 


F 


.05 


1110 


.20 


G 


.03 


1101 


.12 


H 


.02 


11110 


.10 


I 


.02 


111110 


.12 


J 


.01 


mm 


.06 



The average length of combination is 2.73 binary digits; if a fixed length code were used it 
would take at least four signal elements per letter. 

This is a Fano code, as originally proposed by Prof. R. M. Fano of the Massachusetts In- 
stitute of Technology. One way of constructing the code to fit the frequencies is to assign 
the first signal element for each letter, then the second for each, and so forth. Each signal ele- 
ment is a yes-no answer to a question, and will be efficient if the answers are equally likely. There- 
fore, at each step we want half of the code groups to be 0 and the other half to be 1. So we 
select at the first step a subset of the letters whose total frequency is as dose to one-half as we 
can make it, assign 0 as the first signal element to these and 1 to the remaining. Then we split 
the subset into half and assign 0 to one-half and 1 to the other. We continue this until a letter 
is unique in its subset; after that there is no further need for signal elements, and the code for 
that letter is terminated. In this way frequent letters get short codes. A letter with n signal 
elements should have frequency about 

The coding could be made for groups of letters, such as digraphs or words, instead of single 
letters. More efficiency can be achieved with larger units but at the expense of man-hours or 
equipment and delay in sending the signal. The English language is a code of this sort; frequent 
words tend to get shorter, as "automobile” becomes “auto”, “television” becomes “TV”, and 
so forth. 

The Fano coded text is sent as a continuous stream of binary digits. It can be resolved 
unambiguously into its original meaning. In our example if a letter begins with 00 it is a B. 
If it begins with 01 then exactly one more signal element must be examined to determine whether 
the letter is a C or a D. If it begins with 10 it is an A. If it begins with 11 at least two more 
more signal elements must be examined. Thus at each signal element we know whether a let- 
ter has been determined or not, and if so then a new letter begins with the next element. Be- 
ginning from the start of a letter it can always be recognized correctly from ungarbled text. 

Now if a signal element is incorrectly received at least one letter will be wrong. But worse, 
the beginning of the next letter may be obscured. But even so, after a few errors we will get 
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back in phase, and once in will stay in so long as the text received is correct. The reader can 
check this in the following example: 

AJBABEJFED 

10 , 111111 , 00 , 10 , 00 , 1100 , 111111 , 1110 , 1100,011 

(The commas are introduced to simplify our examination of the stream; no spaces or other in- 
terruptions to the binary stream are transmitted.) If we suppose that exactly one of the first 
six binary digits is complemented we find that not more than three letters are garbled. The 
number of letters received is not necessarily the same as that sent. The number of plain letters 
garbled when a signal element is changed is not known in general, even in the statistical sense 
of knowing the expected number. 

The same principles can be used in making a code even though the number of distinct signal 
elements is 3, 10, or 26, and the things represented by the code are digraphs, trigraphs, or words. 

The efficiency of a Fano code is achieved by lowering the redundancy. This is shown in 
the way errors are multiplied. Later in this chapter we will discuss a way of measuring redun- 
dancy. 

0, 1, 1, 2 Redundant Coding. 

Instead of lowering the redundancy for efficiency we might raise it to detect errors or even 
correct them. A simple way to do this is the following. To the five element Baudot code ad- 
join a sixth element which will be 0 or 1 so that the number of l's is even for each code combi- 
nation. Now if a single element is changed in transmission the number of l’s will necessarily 
be odd and the occurrence of a garble will be recognizable. However, the correct value could 
not be reconstructed. 

An error correcting code can be illustrated thus. Suppose we have a 4 binary digit code 
for the decimal digits, together with space, upper and lower shift, comma, period, and platen 
advance, 16 characters in all. Now add some more signal elements to each code combination. 
It will be convenient to make the new ones to be the first, second, and fourth in the group; the 
third, fifth, sixth, and seventh are assumed already there. The fourth is selected so that the 
sum of the last four signal elements is 0 mod 2. The second should be such that the sum of 
the 2nd, 3rd, 6th, and 7th is 0 mod 2. The first should be such that the sum of the odd num- 
bered signal elements is 0 mod 2. 

1 2 3 4 5 6 7 

1 I 1 l 

1 I I l 

t I I l 

Now suppose that a single signal element has been reversed. At least one of these sums 
will no longer be 0. We look at each of the conditions in the order given. If the condition is 
satisfied put down a 0; if not put down a 1. The three binary digit number generated thus will 
be the position within the group of the incorrect element. For example, if the code group 1000001 
arrived we look at the last four digits and add them mod 2. We get a 1, which shows that there 
is an error. We add the 2nd, 3rd, 6th, and 7th getting 1. We add the odd digits, getting 0. 
We now have generated the binary number 110 which is the number 6; the sixth digit is incor- 
rect. The correct group was therefore 1000011. Th s kind of correctable code was first de- 
scribed by R. W. Hamming in the Bell System Technical Journal, April, 1950 (33)*. 



* Numbers in parenthesis refer to the Bibliography. 
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The same pattern could be used to expand a four decimal digit code to make it single error 
correcting, but this time the sums would be 0 mod 10. As an example take the garbled group 
8753708. Using the patterns given before one can find the digit to be corrected, noting only 
whether each sum is 0 or not 0 mod 10. The amount of correct on is easily found. In fact this 
correction is in this case over-determined, which shows that errors in two digits could be detected 
in some cases. 

It is possible to make a double-error detecting or a double-error correcting code, or to make 
a code as reliable as one wishes by introducing enough redundancy. There is a theorem which 
says that by tolerating enough delay information can be sent as accurately as one chooses without 
decreasing the rate of sending. The delay arises because the information must be coded in large 
pieces. 



0, 1, 2 Measuring Information. 

We have seen that the efficiency can be raised by lowering the redundancy, or the reliability 
can be raised by increasing the redundancy. Can we measure the amount of redundancy con- 
tained in a message? We can, and in order to introduce the measure heuristically, we will give a 
preliminary discussion. 

First, consider a yes-no question. How much information is conveyed by the answer? It 
clearly depends on the priori probability of the answers. Compare two questions, paraphrased 
from a civil service form, one of which is “Are you a male?”, and the other “Do you use intoxi- 
cants to excess?” Answers to the first will be half “yes” and half “no”. Answers to the second 
will be almost all "no”. The first gives more information, of course; the second gives almost 
none. If the a priori probabilities are p and q, then a measure of the information will be a func- 
tion H(p,q) of these probabilities. Furthermore has a special value, probably a maxi- 

mum, and H(q,p) = H(p,q). 

If the question has c possible values, such as a letter to be filled in on a form ("middle initial 

c=26), then H(pi,p s , . . . ,p 0 ) is a function of c variables. In general we would expect 

more information to be conveyed by more values, and in particular H(-,- -) £ Hf-A . . . i) 

c c c b b b 

for c ^ b. j ft 

We would like H to have the following properties (34): 

(1) H(i,i, . . . ,-) * JI(ii. . . . i) for c > b. 

c c c b b b 

(2) H(pi,pj, . . . , p 0 ) should be a continuous function of each p; 

(3) If a decision can be made in two steps, then the information H should be the weighted 
sum of the information at each step. An example will make this clear. We want H(px,py,qr,qs) 
= H(p q) + pH(x,y) + qH(rs), where p+q=x+y = l=r+s. 
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The properties defined above are sufficient to uniquely define the function H(pi,pj, . . . , p e ), 
as follows: 

Condition (3) implies that H(^-, . . . = mH(i . . . ,|). For instance, H(-, ...,-) 

gm gm S S 8 8 

= + ^H(^) + = . . . = 3H(jUb. Proceeding, we set the integer m =• log 2 t 

22 2 44 2 44 22 

or t = 2 m . Then H(i, . . . ,1) « mH(i,|) = log.t H(|,|). Let H(|,|) = K. 

Now consider a case where pi = — , S = r s ; that is, the probabilities are all rational. 

s ** 

i=l 

Consider this as choosing among S answers in two steps, using principle (3), 

H<=, • . . ,h = H(p„ . . . ,p c ) + Pl H(I, P jH(— , . . . ,-) + . . . 

s S Ti ri r, r, 

+ PeH(— , . . . , — ), 

r„ r c 

whence H(Pi, . . . ,p 0 ) = H(^, . . . i) - piH(i, p.H(i, . . . , i) 

o S Ti T\ r„ r c 

= K log,S - Kpi logs rs - Kp* log r* - ... - Kp c log r c 
= ( ~Pi log ^ - p, log ^ - . . . - p c log |)K. 

If one of the r/s = 0 = pj we agree that PjH(i ...,)= 0. 

✓ 

H(pi,ps, . . . ,p„) = - (pi log pi + p, log pi 4- ... + Pc log p e )K. 

If the pi are irrational they can be approximated by rationale, and the assumption of con- 

c 

tinuity implies that H(pi, . . . ,p„) = — K 2 p s log pi still. The constant K is positive but 

i = 1 

arbitrary, a choice of unit. Let us agree to take K = H(-,i) = 1, and call this unit a “bit" of 

2 2 

information. 



We now have a measure of the amount of information which can be transmitted under given 
conditions. Other measures might also exist, but this is the only one (aside from the choice of 
unit, the bit) which satisfies the three conditions given above. The reader is warned that this is 
a measure invented by a communicator, and in a sense measures the work of transmitting the data, 
or the capacity of a channel to carry information. It applies to a process or a channel, not to 
semantics. The function H gives a lower bound in binary signal elements on the abbreviation 
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which can be achieved by using a Fano or similar code. Purely flat random material will give a 
maximum value for H, even though it may have no semantic content. 

c 

The measure H(pi, p t , . . . ,p c ) = — 2 p x log,pi has been called the “entropy”, since it 

i = l 

resembles the function introduced by Boltzmann to measure the degree of disorganization in a 
physical system. It has many interesting properties. It is a symmetric function of its arguments 
Pi,Pj, .... Its largest value occurs when the probabilities are all equal, p x = p, = . . . = p c 

= -, and then it is H = logjc. It has the value 0 only if pi = 1 and p : = 0, that is, the answer 
c 

to the question is certain. Probabilities which are 0 can be disregarded, H(p t ,p 2 , . . . ,p„_i,0) 
= H(pi,p t , ... ,p c -i). 




Graph of H(p,q), p+q = 1. 
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The entropy is a measure of the disorganization of a physical system, and H is a measure of 

the lack of pattern in a message. If we let R = we have a measure of the redundancy. 

logic 

If R =0 no abbreviation can be achieved by coding. If R > 0 then some abbreviation is possi- 
ble without losing information. If R = 1 then H =0 and no information is being transferred. 

An example with R = 1 is the meassge EEEEEEE. . . E. 

The source of text can be examined digraphically by the same function R. In this case c is 
replaced by c 1 , so that 



R = 2 log c— H(pn,pn, . . . ,p K ) 

2 lpg c 

It can happen that digraphic or polygraphic examination will reveal redundancy not shown by 
the previous measures. For example a live digit code with a garble check will have no more in- 
formation pentagraphically than it has tetragraphically. 

A way of estimating the entropy of a source from a long message is the following: Consider 
a message of length N. It contains U of the first letter, f t of the second, and so on. The expect- 

f f t 

ed value of t, is piN. The probability q of exactly this message is q =*pi p> . . . p. ", or log q 
c 

= 2 'i log pi. The expected value of log q is 
i = l 



See (34) theorem 3. 



E(logq)= 2 piNIogpi 
i=l 



= -NH(pi, . . . ,p.), or 

E(log -) 

h g 



Shannon states that the redundancy for English is approximately R 

a 



One way to esti- 



mate this is to to see what proportion of letters can be eliminated (at random) from English text 
without concealing the meaning from a discerning reader. Following is an example with more 
than half the letters (and the word spaces) deleted in a pattern taken from a random number 
table. 



Y.U.A...OL.-.AT..R-.IL.IA. 

. . E - . . UN . - MAN .... D - AN . - YOUR 
. A . . - HAS ...C...-V.R.. WH . . . 

A . D - YE . - Y . U . INC . . S . . T . Y . . . A . . 
....O.R...A.-D... OU . T . INK - 
... Y . U . — .......I. — . I G . , 
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The symbol is a word spacer. Some word spacers are present, some are absent. About 
half the letters are shown; find the others. 

Here is another example. Again about half the letters (48%) have been suppressed, but this 
time it is the less frequent letters which have been suppressed. 

.HE HIN...S..I.SIS E...E.H. 

N I O.ESSIN.HI.HI..OSSI.I.I.IES 

,.ENO..ONSI.E.E. p HO.E.E.H..IN.. 

O . N . 

No word spacers are present. The little p is a period. 

E O N I S H These 6 letters have 52 occurrences. 

TARDLUC MFGBVY WPJKQXZ 48 occurrences suppressed 
In this third example the more frequent letters are suppressed, 54% of the text. 

MOD.RN..G.BR.H...XPO..DFOR.H 
.F.R....M..H.FU..V.R...Y.NDR 
.CHN...OFPO... B..M..H.M...C.. 

.Y..M. P W..H... 

E T A I S L Of these 6 letters 54 occurrences are suppressed. 

RNMOFGBH UVYCWDP These 15 letters appear 46 times. 

J K' Q Z do not appear. 



From this outline of information theory we can see two things. First, the theory is inherently 
statistical. The statements of information theory all deal with large numbers of elements, never 
with single elements. In fact most questions of statistical theory are motivated by attempts to 
derive information from incomplete or diluted data. Second, that cryptology is very much con* 
cemed with the same questions as is information theory. 
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0, 2 Statistics. 

Almost all statistical theory is applicable in some way to cryptologic problems. That theory 
most readily applied is distinctly mathematical. We refer the reader to: 

(1) “Introduction to Mathematical Probability” by Uspensky, McGraw-Hill. 

(2) "Mathematical Methods of Statistics” by Harald Cramer, Princeton University Press. 

(3) "Mathematical Statistics” by S. S. Wilks of Princeton. 

(4) “An Introduction to Probability Theory and its Applications” by W. Feller. 

One of the frequent cryptologic problems is testing hypotheses. That is, we have some data, 
a cryptogram say, which may have come from one of a number of causes. Which one is most 
likely? The available evidence in cryptanalysis can normally be digested by using Bayes’ theo- 
rem. Usually a multiple of the logarithm of the Bayes factor is computed, that for each unit 
being called its “weight” in some appropriate unit. This is described in Chapter 2. 

The cryptanalyst is often trying to draw inferences about a "universe” of cryptograms from 
a sample. This sample is not under his control; he has to take what is intercepted. The small- 
ness of the samples may prejudice his inferences. The first case discussed is, given the frequency 
count of a sample, to estimate the distribution of the universe from which it came. A first esti- 
mate would be the proportion fi /N for a letter which occurred fi times in a sample of N. Under 
certain circumstances better estimates can be made. This is taken up in Chapter 3. 

Because rectangular arrays are so much used in cryptanalytic statistics, a preliminary chap- 
ter is devoted to matrices. This is followed by an exposition of flagging (a technique of sorting 
distributions into two or more sets) with emphasis on its matrix character. 

A chapter on Fourier methods is followed by one on circulices, which shows that the Fourier 
transform is included in matrix theory. 

All the matrix and Fourier techniques are aimed at recovering periodic tendencies in a stream 
of cipher text or key. 
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1. Review of Elementary Mathematical Statistics. 

Some of the results of Kullback (6) and other sources are summarized here for ready reference. 

One of the first questions arising when looking at a sample of “random” objects, such as a 
stream of letters, is, "Is it really ‘random’?” To answer this we need to know what randomness 
is. It is frequently taken to be the equivalent of “flatness.” This latter term can be visualized 
in the following way. Suppose that a histogram is drawn from a frequency count of the sample. 
It might look as follows: 




The sample pictured here is "flat,” for the deviations which occur from the average are what 
might occur from chance. 

If the sample were from a source which is not pure chance, but has a pattern, such as letters 
from newspaper text or digits from a telephone directory, then the histogram might appear thus: 




* 



4 



This sample is “rough,” rather than flat. 

There are several statistics invented to measure roughness. One of these is <t> (pronounced 
phi). If the observed frequencies are fi, f ,, fj, and so forth to f„ where c is the number of cate- 
gories, then 

c 

(I, 1) * - Z fi(fi-l). 

i = 1 



For a given size N = S fi of sample £ is larger for rougher and smaller for flatter samples. 
i = l 

N 

The smoothest possible sample is that in which all frequencies are the same, ff — — = f j . Then 

c 



N,N 



,N 



<f>= 2 — ( 1)=N( 1). The roughest possible frequency count is that in which one count 

c c c 



has everything, fi = N, and all others have nothing, fi = 0. Then 4> — N(N — 1). The values 
of 4> for intermediate roughness are between these extremes. Another observation will illustrate 
how <t> varies. Suppose we take a little, say 1, off a small count, say fi, and add it to a larger 
count, say f s . We now have a new count f'i, where f\ = f t — 1, W = f* 4- 1, and 
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fi' = fi for i > 2. What has happened to 0? It has become: 



0' = 2 fi'(fi'-l) 
i = l 



= f/(f/-l) + f a '(f*'-l) + 2 f|'(fi'-l) 

i =3 

c 

= (f,-l)(f,-2) + (f a +l)f a + 2 fi(fi-l) 

i =3 



= (fi-l)fi + f a ! - f a + 2 f t (fi-1) - 2(f i — 1) + 2f a 

i=3 

= 0 + 2(f*— fi) + 2. 



Since f 5 is larger than fi we see that 0' is larger than 0 by at least 2. The greater the a priori 
discrepancy f a — fi between the two counts the more this operation increases 0. 

The drawing of a sample is often pictured in the following way. Suppose we had a large 
barrel filled with tiny scraps of paper, on each of which is a mark, such as a letter. The number 
of scraps is very large, and the number of distinct marks is c. The barrel is referred to as the 
“universe,” and the contents of the barrel as the “population.” We stir up the population vig- 
orously, and then reach in and withdraw a handful of scraps of paper; this is the “sample.” The 
number of scraps of paper with mark i on them is fi. 

The notation E(x) is read “the expected value of x” and is defined to be the weighted mean 
of all the possible values of x. 

Suppose the marks in the population are in proportion Pi : p* : p* : .... 

Then the proportion f ; /N of marks i in the sample is expected to be p : ; E(fi/N) =p if or E(f s ) =piN. 

From this we can find that 



( 1 , 2 ) 

see (6), paragraph 18. 



E(0) = N(N-l) 2 p,*, 
i = 1 



A frequently used function of 0 is the Index of Coincidence (abbreviated I. C.). 



(1, 3) 



c 0 

" N(N-l)' 



The expected value of S is 



c 

(1, 4) E(S) = J E(0) = c 2 Pl *. 

N(N-l) i = 1 

For a flat universe, that is , one for which pi = 1/c for each i, 

c 

E(i) = c 2 l/c* = l, 
i = l 
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which is convenient to remember and use. An exposition of the I. C. can be found in the pam- 
phlet The Index of Coincidence (18). 

The expression, 



(1,5) y= — 2 f,* 

N i = l 

c 

approaches S as N grows large, and is sometimes used instead. The factor 2 fi 2 = # is sometimes 

i=l 

bsed also. The statistic + is related to 4 by 



( 1 , 6 ) 



* = i ~ N. 



The variance of <t> and ^ are derived in Kullback ( 6 ), Appendix D. It is 



(1, 7) **(*) = <r*(*) 

c c c 

= 2N(N — 1) 2(N— 2) 2 Pi*+ 2 P i*— (2N— 3)( 2 Pi*)* . 

i = l i = 1 i = 1 

For the special case pi = 1/c, which is used continually for comparison purposes, this reduces to 
(1,8) = 2N (N — 1) — 7 — ■ 



Therefore for the I. C. the variance for a flat universe is 



(1, 9) 



<r*(S) = 2N(N — 1)- — - = 2 — — = 2—. 

' N*(N-1)* V ' c* N(N-l) N* 



This follows from the theorem that for a variable x and constants a and b, 

PL 86-36/50 USC 3605 

(1, 10) «r*(ax+b) = aV(x) ^0 ^ 3(h)(2 




If we have a quantity which takes on various values, and if these values are determined by chance, 
such that we can express the probabilities that a < x < b, then we call the variable a “statis- 
tical” or "stochastic” variable. For example, if we cut some newspapers into little pieces, one 
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letter on each piece, put these in a barrel, and then draw them out one by one, the letter drawn 
is a stochastic event. The probability for the occurrence of E is greater than that for Q, and for 
a large enough sample the letters will appear in proportions which can be predicted approximately. 
A description of the probabilities of the values is called the "distribution” of the variable. 

Two events are said to be “independent” if neither affects the outcome of the other. For 
instance with our barrel of newspaper text, if after examining a sample we replace it in the barrel 
and stir before drawing another then we can be confident that the results of the two samples are 
independent. On the contrary if we keep out one piece with one letter on it then subsequent 
draws will be modified, potentially at least. The determination of whether events are independ- 
ent or dependent, completely or partially, is sometimes of great importance in ascertaining the 
significance of those events. 

1, 1 The Binomial Distribution. 

If we have a stochastic event which can take on two values (such as "hit” or no "hit” between 
letters of two texts which have been lined up) with probabilities p and q, then if x is 1 or 0 accord- 
ing as one or the other event occurred, x is "binomially” distributed. Then 



E(x) = p.l+q.O = p, and 

ff*(x) = E(x*)-E*(x) = p.l* + q.O*— p* = p(l-p) = pq. 
n 

The sum y = 2 x is the number of times the value 1 came up in n trials. 

1 



(1, 1, 1) E(y) = pn and 

(1. 1, 2) «r*(y) = npq 

(if the trials do not affect each other). 

The probability of a specific count y - k is 



( 1 , 1 , 3 ) 



P(y=k) = 



n! 



k! (n— k)I 



pkq 



a— k 



PL 86-36/50 USC 3605 
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The probability of y = k or more is 



(1. 1, 4) 



P(y^k) = 2 



n! 



p 7 q a 



y — k y!(n-y)L 

Tables of these two probabilities have been compiled, (10). 
1, 1, 1 Example. 
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1, 2 The Multinomial Distribution. 

If a stochastic event can take on c values with probabilities pi, p 2 , . . . , p«, then in n 
trials the probability of getting a specific frequency count fi, U, . . . , U is 



(1, 2, 1) 



P(fi, f*» • • • » f«) 



Pipi Pi ... Pc 
fif f,! . . . f.! 



The probability of getting this count or a less likely one is a less easily handled question, and is 
more important. The usual method of handling it is to transform the question into terms of 
another distribution, such as the binomial or the Poisson or the normal. 



1, 2, 1 Example of Multinomial Distribution. 

If a barrel is full of English newspapers cut into little pieces, one letter to each piece, and 
then pieces are drawn haphazardly after stirring, frequency counts of samples will be multino- 
mially distributed. 

Or take a simpler example. Suppose we have a 5-letter alphabet, A, E, I, O, U, in the ratio 

60:30:20:15:12 in the barrel. The probabilities are then ^j-, and If we 

lo i lo7 ld7 lot lo I 

draw a sample of 46, the most probable frequency count is f A = 20, f E = 10, fi = 7, f 0 = 5, 

= 4. The probability of getting exactly that is 



P(20, 10,7,5, 4) = 



m ,/60^o / 30 \ 10 / 20 \ 7 / 15 / 12 v « 

’ ' 137 ' ' 187 / ' 187 ' ' 187 / ' 187 / 



20! 10! 7! 5! 4! 



= . 001 . 



1, 3 The Poisson Distribution. 

If we are dealing with very large total counts n it is convenient to think in terms of the ex- 
pected value a. From (1, 1, 1) we have a = pn. Then (1, 1, 3) becomes, if we separate the 
factors depending on n, 



P(y =K) ^ — (-) (l-*Y 

’ K!(n-K)lV ' n* 



a x «-K 




n(n— l)(n— 2) . . . (n— K+l) 





CONFIDENTIAL 



15 



ORIGINAL 



REF ID:A64688 



J 

CONFIDENTLY 

! This approximation is good if a/n = p is small and n is large. Tables of this "Poisson Distri- 

bution" have been published, (13) and (14), along with the cumulative function, 

P(y^K) = s — e~\ 

! y=K y ! 

I Since the tables are independent of n they are short and easy to use. 

j 

|i 1,3, 1 Example of the Use of the Poisson Distribution. 

In a list of 98,000 four digit groups the most frequent group occurs 29 times. Is this ex- 
i traardinary? Here p*= .0001, and a = pn = .0001 x 98,000 = 9.8. Table II (13) says that 

when 9.8 are expected, 29 or more will occur 1 time in a million, p = .000,001, when the partic- 
1 ular group is specified in advance. We must remember that this is the best in 10,000 tries, so 

that in a hunched samples of this kind there would be a million opportunities for a group to be 
frequent. Therefore this result would occur about once in a hundred such random experiments. 

I, 4 The Normal Distribution. 

If yx, yj, . . . , y. is a sequence of stochastic variables all with the same distribution, where 
the mean is E(y) and the variance «r*(y), then the sigmage 






(1, 4, 1) 



S = 



2 yi— n E(y) 
i = l 

VS V (y) 



is a stochastic variable with mean E(S) = 0 and variance <r*(S) =1. If n is large, it can be 
shown that 



(1, 4, 2) 



1 f* t* 

P(Si»x) = e~* dt. 

v'SirJ — 



This is the "normal" distribution with mean 0 and variance 1. The condition that a variable be 
a sum of a large number of similarly distributed quantities is frequently fulfilled, and conse- 
quently the normal distribution has wide application. For instance, Gauss assumed that errors 
in measurements were accumulations of smaller errors, and derived (1, 4, 2) for the distribution 
of errors. Sometimes this is called the "Gaussian error function," or the "bell shaped curve." 

1 ~ x * 

The latter refers to the appearance of the graph of y = —= e * . 




I 



i 







I 
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1, 4, 1 Example of the Use of the Normal Distribution. 

In solving a cipher system a method of exhaustive trials has been devised, exactly one trial 
of which should give plain text. To help decide which trial was successful, each candidate for 
plain text is judged by a set of weights, the sum of which is a “score.” The expected score for 
random happens to be —10, with a standard deviation of 4.5. After 10,000 trials the best score 
found was 8; how good is this? This is 8 — (—10) = 18 above expected, or exactly 18/4.5 =■= 4 
sigmas. Since the scores are sums of other stochastic variables, the normal distribution applies 
approximately, and 4 sigmas or better will occur about once in 33,000 trials from (16). Since we 
have made 10,000 trials already, we should get a score this good or better once in 3 such experi- 
ments from random material. This is a discouraging result for the cryptanalyst. 

1, 5 The Chi-Squared Distribution. 

Consider the distribution of the stochastic variable 

(i» 5, l) x* - yi* + y** + . . . + y»*. 

If the y’s are themselves independently and normally distributed, then it can be shown that 



( 1 . 5 , 2 ) 



P(x*gx) 




e* dt. 



see (2), Section 18.1. The parameter n is the number of degrees of freedom. It is important 
in using formula (1, 5, 2) to have n as the number of independent summands in (1, 5, 1). Ta- 
bles are available of (1, 5, 2), for instance (12). This distribution is important because f and the 
I. C. are asymptotically distributed this way. For an heuristic treatment of Chi-square and re- 
lated statistics consult (30). 



1, 6 The Zipf Distribution. 

Mr. Zipf (28), in a theory he developed on the use of tools, predicted that if certain items 
(such as words) were ranked in the order of their use, so that f s is the frequency of the i th most 
frequent, then i-f i would be approximately constant. That is 



(I. 6, 1) 



f« = fi/i. 



This is found to be reasonably accurate for codes, with the exception of the most and least 
frequent groups. Sometimes the assumption that the k most frequent groups have been con- 
cealed gives a better approximation. Then we have 

(1, 6, 2) (i+k) U = (1+k) f„ 



or fi 



1+k. 
t ; t 
i+k 



This is a distribution in the mathematical sense, and one of considerable cryptologic interest. 
If we write 



c 

(1, 6, 3) 2 1/i = t 

i = l 
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then from (1, 6, 1), 

(I, 6, 4) 



Then the expected rank is 

(1, 6. 5) 



c 

since t = 2 1/i == log«c 
i = l 

The variance of the rank is 



c c 

N = 2 fi = 2 fj/i = tfi. 
i=l i =1 



c 

E(i) = 2 i fi/N = c fi/N 
i = l 




= c/log. c 



( 1 . 6 , 6 ) 



2 ( j ) = ( c + 1 ) <5 _ 2 ! 



2t 



t ! 



since 



E(i») 2 i*fi 



N 



i=l 



1 c 

2 if, 

N i=l 



f c 
ii . 

= N 2 1 
W i = 1 

f, (c+l)c c+1 



tfi 



2t 



All the moments can be computed in terms of the sums 2 i k , which are evaluated in Section 405 

i = l 

of reference (8). 



1, 7 Approximate Distributions. 

The normal, Poisson, and chi-squared distributions provide approximations to the finite 
distributions we are usually interested in, approximations which get better as the sample size 
increases. That is, the sum of similarly distributed variables is asymptotically normal. 

There is a statistic measuring the “goodness of fit” of two counts, the "chi-squared,” defined 
as follows: if fi and gi are the components of the counts, with 

c c 

2 f, = N, 2 gj - M, 
i=l i=l 
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then 

( 1 . 7 , 1 ) 



C Mf * XT 

2 J=- - N. 
i=l Ngl 



This asymmetrical version was derived from the viewpoint that the gi represented the universe, 
and the ft a sample. There are other versions. This statistic is asymptotically chi-squared 
distributed. One must distinguish carefully between the chi-squared statistic and the chi- 
squared distribution. 

The cryptanalytic statistics <t>, 6, and y are all distributed asymptotically chi-squared. 

That is, these distributions can be computed from that of x*> given in Section 1, 5. The last 
two of these have been tabulated, see (17). 

The index of coincidence found by counting hits has a binomial distribution (17). 

The Cross I. C. f 

(1,7,2) 2 f* b 

has a distribution computable from the Incomplete Beta Function, see (17) for the tables. The 



distribution is as follows. Let I y 

been tabulated by Pearson (26). 

and G = 2 gi*, then 

N» 



stand for the Incomplete Beta Function, which has 
If the I.C.’s of the component distributions are F = — 2 f i*. 



(1, 7, 3) P(*2:l + x v'(F-l) (G— 1) - I i X (^l, |). 



The derivation of this is given by Gleason (27);’ This statistic is related to the correlation co- 
efficient. 

Since these approximations are asymptotic, and since cryptanalytic work is frequently with 
small samples, some experiments have been made (23) with the 5 I. C. to test the accuracy of 
our tables. With the smallest sample tried, c = 5 and N =7, the least accurate estimate was off 
by a factor of 10, where the probability of a sigmage of 6 was given as .057 by chi-square, while 
in fact it occurred in only .0054 of the cases. In a less extreme case, c=5, N =15, The prob- 
ability of a sigmage of 11.5 is .00013, while a chi-squared estimate is .000025, too small by a factor 
of 5. The number of counts per category, N/c, is a criterion for the accuracy of the chi-square as 
an estimate for the distribution of the 8 I. C. We see that with N/c =3 it is off at most by only 
a factor of 5, not too bad for most cryptanalytic applications. 

For many statistical questions in cryptanalysis extremes (“tails)” of a distribution are need- 
ed. For instance, from Poisson it may be required to know approximately the probability of 
getting 320 or more successes when 200 are expected. This is far beyond the tabulated values 
and the calculation is very laborious. There are special methods of getting good approximations 
for the extremes to some of these distributions. The standard methods for approximating near 
the mean are nearly useless. 

A method of getting as close an approximation as one pleases to the cumulative Poisson has 
been given by Cramer and Gleason (21), and is presented below. The derivation uses a con- 
tinued fraction expansion. 
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The probability of c or more successes when a are expected is 

00 

P(c,a) = 2 
k=c 



This can be approximated in the following way. 



P(c,a) = 



a«e“* Ai 
c! Bi 



where Ai and Bi are defined recursively by 

Ai = Aj_i + (— l) ,-l a ai Ai_, 
Bi = Bi_i + (— l) ,_l a ai Bj_* 
where for convenience ai is defined as 

a,m+1 (c+2m— 1) (c+2m) 



= c+m-1 

,m (c+2m— 2) (c+2m— 1) 

This gives an iterative method of approximating P(c,a). Each approximation is better 
than the last, and some are overestimates and some underestimates, so that the true value is 
boxed in. 



1, 8 Regression. 

A statistic may seek to measure the interrelation between the components of the data. For 
instance the data may be the ages of men, each with that of his spouse. In such a case the data 
can be plotted on a graph, orie point for each datum. This graph may have to have high dimen- 
sion, but theoretically this is no objection. The data will ordinarily give a cloud of points and 
the shape of this cloud will be of interest. One may particularly look to see if it is elongated, and 
if so in what direction. If it is, then there will be an axis along which it is stretched, and this 
axis can be found. It is called the "line of regression". It is the line which is closer to all the 
points, in a sense, than any other line. More precisely, it is the line such that the sum of the 
squares of the distances of the data from it is a minuimm. A method for finding this line will be 
given in section 4, 6 after some vector techniques have been discussed. 



2. The Matching of Distributions. 
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2, 1 Goodness of Fit Defined. 

If a universe U is known, the exact probability P(S) of a particular sample S of size N can 
be computed by using the multinomial expression of Section 1,2. Furthermore the possible 
samples T of size N which give a lower probability, P(T) < P(S), can theoretically be enumer- 
ated, since there are only c N samples possible. Let M = 2 P(T) + ]4 2 P(T) 

P(T) < P(S) P(T) = P(S) 



Then Dawson (19) caUs the relative number, — , the "goodness of fit (g.f.)” of S with U. To 

c“ 



paraphrase then, the goodness of fit of S with U is the probability that a random sample T will 
give a number P(T) less than P(S). The "poorness of fit" can be defined as 1-g.f. If U is the 



flat universe with pi - - then the poorness of fit of S with U can be called its “roughness”, 
c 



The advantage of this definition is that samples are ranked according to the probability with 
which they would arise. For example, consider the two samples from a 6-letter alphabet, 
Si : 10,10,10,10,0 and S s : 6,6,6,6,16. Here N = 40. The gamma I. C. of each is 5/4, but the 
roughnesses are not the same, for Si will occur in drawings from a fiat universe over 26 times as 
often as Si. 

In (19) it is shown that the distribution of the goodness of fit is closely determined by the 
distribution of 

c 

(2, 1,1) s=2 (fi+1/2) log f,/Np,. 

i=l 



Sometimes s itself is used as a measure of roughness. The same source (19) shows that 2s is 
distributed asymptotically chi-squared. That is, 

(2, 1*2) P(s«£x) — 

2 ir<|> 

which has been tabulated for the x* statistic. 

2, 2 Goodness of Fit of Two Samples. 

The probability of drawing 2 samples Si and S, of size Ni and Ni respectively from a 
universe at random is the same as that of drawing a single sample S of size N = Ni + N, and 
then separating S at random, getting Si and S s . If the universe is unknown the likelihood of 
the 2 samples arising from the same source is measured by the probability of such a split. By 
neglecting constant factors the function 

c 

(2, 2, 1) F = r f,! gi! 

i=l 

is arrived at as a measure of the poorness of fit of 



rax 






t* e >dt. 



Si :{fi} and S, : {gi}. 



See (19), Section 10, for details. The distribution of this is not so well known. 
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WEIGHTING TECHNIQUES 

2, 3 Bayes’ Theorem. 

If Hj and H 2 are two hypotheses, and if E is an event, and if P(E/H0 is the probability 
of E if Hi pertains, then the probability of E and Hi is 



(2, 3, 1) 



P(E,Hi) = P(Hi)P(E/Hi). 



This can be rewritten to 

(2, 3, 2) P(Hi)P(E/Hi) = P(Hi,E) = P(E)P(Hi/E) 

which can be solved for 

™ _ P(Hi)P(E/Hi) 

P(H,/E) — m~' 

The ratio of these for i = 1,2 gives 



Bayes' 

Theorem 

Stated 



P(H,/E) P(HQ P(E/Hi) 
P(H*/E) = P(Hj) ‘ P(E/H,) 

The odds 
on the hy- 
pothesis ■ is ■ 
having ob- 
served the 
event 



PfE/HO 

This factor is p^/H ) ’ t ^ ie “® a y ea Factor”. 



the odds preceding 
the event multi- 
plied by a factor. 



PL 86-36/50 USC 3605 
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This says that having observed an event E we can draw some conclusions about the possible 
causes of this event. The possible causes are designated Hj, H* . . . . The probability of 
Hi is written P(Hi), and the probability of Hi after having observed the event E is written 
P(Hi/E). The probability of the event E if Hi is in fact the cause is written as P(E/Hi). The 

p/JJ ) 

theorem concerns the ratio of probabilities, or odds. — is the odds in favor of Hi against H,. 

P(Hi) 

is the same odds after the event E has occurred. — is not an odds, since the 
P(H*/E) P(E/H,) 

events are not alternatives, but is called "the Bayes factor”. This theorem gives an objective 
way of considering circumstantial evidence. 



2. 4 An Example of Application. 
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CONFlDENTIAEr - 

3. Unprejudiced Estimates of Universes. 

3, 1 The Law of Succession. 

To decide whether a frequency count comes from one of two universes one should know the 
distributions in those universes. Usually only samples from those universes are available. For 
example, suppose we are to devise a test for English newspaper text. We count 200 letters of 
newspaper text, and use this count as an estimate of the frequency distribution. But in this 
count the frequency of Q is zero. If we accept this as a fact, then no sample containing a Q can 
possibly satisfy our test for newspaper text. The sample has given us a “prejudiced" picture of 
the universe. If we add 1 to each count and use the result as an estimate, then no letter will be 
impossible. This procedure can be justified rigorously. If { f E } is a frequency count of such a 
sample, then {l+fi} is an unprejudiced estimate of the universe, under the hypothesis that a 
priori all distributions are equally likely. 

The derivation is as follows: Given a process producing a c-letter alphabet, suppose a sample 
S of N letters has been drawn, getting f E cases of the ith letter, 

c 

2fi=N. 

1 

If one more letter is drawn, what are the odds on its being the ith letter x E ? Suppose that before 
looking at the frequency count { f E } the odds are even on all letters. Then after looking at the 
counts the odds are fi+1 : f»+l : . . . : f„+l. A more vivid picture is as follows. Suppose 
that initially there are c M+1 hats, each with N+l letters in it, and each with a different fre- 
quency count. All possible frequency counts of N+l letters are available in the hats. Now 
a hat is selected and N letters are drawn from it. What are the odds on the next letter? The 
odds are l+fi: 1+f*: l+f»: .... For the probability of drawing our sample S and then 
drawing x E is the same as that of drawing a sample S+x E and then drawing x E from the sample, 

P(S+Xj) = P(S+Xj). Then the odds on x t 

. PL 86-36/50 USC 3605 

* N+l EO 3.3(h)(2) 

Thus the odds on the various letters are 1 + f t : 1 + f 2 : 1 + f,: . . . . 

The use of this estimate of the universe becomes very important if some f, = 0. For in 
that case unreasonable log weights of <*> , or even <*> — » , may arise if the estimate f E is used. 

See (29) for a longer discussion of the conditions under which the modified counts may be used. 

3, 2 Code Groups. 



that is, P(S+Xi) P(x i /S+x 1 ). We assume that 
toxj are 



( 3 , 1 ) 



P(Xj/S+Xi) : P(Xj/S+xj) = 



l+fi 

N+l 
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4. Some Matrix Definitions and Properties. 

A matrix is a rectangular array of numbers. Matrices are recurring items in the statistics 
of cryptanalysis, as will be seen in section 5. 

Three ways of indicating a 3 x 2 matrix (the number of rows is always given first) are 



(4, 1) 



a 0 o aoi 

ait an 

(a»o a«j 



« (aii) = A. 



4, 1 Elementary Properties of Matrices. 

Two matrices of the same size can be added by adding corresponding elements, 

(4» 1| 1) A + B = (aij) + (bij) = (aij+bij) 



'3 2' 




-2 0 




1 2 


0 -1 


+ 


1 2 


= 


1 1 


U -iJ 




1 -lj 




,2 — 2 ) 



In talking about matrices as entities like numbers we need a word to describe numbers them- 
selves. We call them "scalars”. Scalar multiples of a matrix are defined by 

(4, 1, 2) mA = (m a^), 

that is, multiply each element by the number m. For example, 

3 

The scalar product on the right is defined the same way, 

A m = m A 

The matrix of which each element is zero is represented by 0. 



’ 1/3 2 




1 6) 


l-i oJ 




[-3 OJ 



CONFIDENTIAL 



33 



ORIGINAL 





REF ID : A64 688 



CONFIDENTIAL 

A combination m A + n B of two matrices A and B is called a “linear" combination. This 

c 

can be extended to a linear combination of any number of matrices 2 miAi. If some linear corn- 
el 

c 

bination of a set of matrices is zero, 2 mAi = 0, then the set is said to be "linearly dependent”. 

i = l 

If every linear combination (except that with 0 coefficients) of a set is different from zero the set 
is said to be “linearly independent”. 

Examples: The 1x2 matrices (1, 3) and (2, 6) are linearly dependent. The 1x3 matrices 
(1, 1, 5), (-2, 1, -1), and (-1, 2, 6) and linearly independent. 

The product of two matrices can be defined if the number of columns of the first is the same 
as the number of rows of the second. It is defined thus: if A is a k x m matrix, and B an m x n, 
then 

m — 1 

(4, 1, 3) AB = ( 2 at,b r j) 

r =0 



and is a k x n matrix. Each row of A is multiplied by each column of B and summed to give 
an element of the product. In general BA AB. 



3 . 2 




ft i 




f 7 


5 


2] 


0 -1 




2 1 -J 


- 


-2 -1 


2 


1 -lj 






l-l 


0 


4j 


1 1 


21 


f3 2' 


1 


5 -1] 




2 1- 


2j 


0 -1 
1 -1 


’ 1 


4 5j. 





It can be shown by straight-forward computation that 

A(B+C) = AB+AC and (B+C)A = BA+CA. Also A(BC) = (AB)C. 

The square matrix I = (i y ), where hi = 1 and hi = 0 for i ^ j, is called the “identity”. I. 
The “identity” has l’s on the principal diagonal and 0’s elsewhere. It has the property that 
IB = B = BI for each matrix B, when these products are defined. The matrix O with every 
element 0 is called “zero”. 

A matrix with only one column, m x 1, or with only one row, 1 x n, is called a “vector”. 
Vectors are especially important in the theory of matrices. If v is a column vector, m x 1, and 
if M is a k x m matrix, then Mv = y is a k x 1 column vector. If M is square m x m then Mv 
has the same dimensions as v. 

4, 2 Determinants. 

A “determinant” is a number derived from a square matrix by applying certain rules of 
manipulation. These rules are; form every product possible by selecting exactly one element 
from each row and column, and add all these products with proper signs. The signs to be used 
are determined by a rule which may seem complex when first encountered. Half of the signs 
are positive and half negative. If the elements in a product are arranged in the order of the 
columns from which they come, then the rows from which they come are in a permuted order. 
If this permutation is odd the sign is negative. If it is even the sign is positive. An explanation 
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of when a permutation is odd is in order here. A permutation or rearrangement can be accom- 
plished (sometimes in several ways) by successively interchanging pairs of elements. If the 
number of interchanges is odd the permutation is odd. See reference (11) for more details. If 
A is a square matrix, the determinant with the same elements will be written as | A | . For 
example, 1 1 1 =1, and 1 0 1 = 0. It can be shown that | AB | = | A | . | B ) . 



In general |A+B| ^ |A|+|B j. For instance, if A = ^ and B 



( 10 ) 

[oij* 



and 



10 

01 



= 1, while | A | = 0 = |B 




then A+B = 



It can be shown that a determinant vanishes if and only if its columns are linearly dependent. 
That is, if | a^ | is the c x c determinant, | a (i | = 0 if and only if there exist c numbers, v,, not 
all zero, 



c— 1 

(4, 2, 1) such that 2 a^ Vj = 0 for each i. 

3=0 



Examples of the application of this last property are as follows: 




= 0, since v 0 = 1, Vi= —1 will satisfy (4, 2, 1); 



Also 




= 0, since v 0 = y, Vi= — x will satisfy (4, 2, 1); 



2 3 2' 

-6 12 

3 +2 — 4 J 



= 0 . 



The coefficients v 0 , Vi, v* can be determined by the student as an exercise. 

If A = (a t j) is an n x n matrix then (may) is also a matrix and | man | = m n | a^ |. For in 
evaluating the determinant each product has n factors of m, giving each term a factor of m n . 

If each element of a particular column of A is multiplied by m then the determinant is multi- 
plied by m. For each product will have one factor of m, so the sum has a factor of m. Similarly, 
if each element of a row is multiplied by m, then the determinant is multiplied by m. 

If two columns of a determinant are interchanged then the sign of the determinant is changed. 
To see this look back at the rule for determining the signs. The interchange of two columns 
puts an additional interchange into each permutation, thus changing odd to even and even to 
odd. Thus all the signs are changed, changing the sign of the sum. The same argument holds 
for the interchange of two rows. 

A logical consequence of this property is that a determinant with two identical columns must 
be zero. For the interchange of those two columns does not change anything, and yet requires 
that the sign of the determinant be changed. This can only happen if it is zero. The presence 
of two identical rows also implies that the determinant is zero. Consequently if one column 
(or row) is a multiple of another the determinant is zero. 

If the elements of one column of a determinant are each the sum of two numbers then the 
determinant can be expressed as the sum of two determinants. That is, 



ai bi+xi Ci 




ai bi Ci 




ai xi ci' 


a* b 2 +Xi Cj 


= 


a* b* c» 


+ 


a> x, c 2 


a, b«+x* c, ; 




a* bi Cj 




a» x» c» 
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For in the expansion of the first by the definition each term would have a binomial factor which 
means that the term can be written as the sum of two terms, thus 

ai(bj+Xi)c k = aibjCk+aiXjCk • 

If we collect the first of each of these pairs of terms we get a determinant, and those left form a 
determinant. The analogous fact holds for rows. 

Given a determinant, it can be modified by adding a multiple of any column to any other 
column without changing its value. This follows since the resultant determinant can be con- 
sidered to be the sum of two determinants, one of which is the original and the other is zero. The 
same holds for rows. This theorem is very useful in calculating the value of determinants, 
using the result of the next paragraph. The direct computation from the definition is not ordi- 
narily useful, since n! terms are involved, each a product of n factors. With n larger than 4 
this is a formidable amount of work. 



If a determinant has nothing but zeros above the principle diagonal it is said to be in “tri- 



angular” form. For example 



10 0 
2 5 0 . 
9 7 2 



The value of a triangular determinant is the pro- 



duct of its diagonal elements. For in the expansion by the definition each term is zero except 
one. Those terms using any element of the first row except the first element must be zero. Those 
using the first tarn of the first row must use the second term of the second row or be zero, and so 
forth. 



By adding multiples of columns to other columns, or of rows to other rows, it is possible to 
reduce the determinant to an equal triangular determinant which is easy to evaluate. 



Exercise: 



1111 
12 2 2 
12 3 3 
12 3 4 



Exercise: Show that this determinant is zero. 



-15 13 3 -9 

6 8-6 7 

3 9 9 5 

7 8 5 8 



4, 3 Inverses and Conjugate Transposes of Matrices. 



If AB = I = BA, the matrix B is called the "inverse” of A, and is written B = A -1 . The 
matrices A and B must be square and the same size. Some matrices have no inverse; these are 
called “singular”. For instance, 



The matrix S 
and obviously 




x— z 



1 -1 
-2 2 

- 1 and 



f° V2] 1 = 


^2 1/3 


13 -3J 


[2 Oj 


f-1 21 - 1 


-1 2' 


l o lj = 


0 lj 



has no inverse. For S [ x y ] = x „ z . . t>, ^ T] 

(zwj [— 2(x— z)— 2(y— w)J 

— 2(x— z) = 0 are not both possible. 
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The matrix (au)* = (aji), is called the “conjugate transpose” of (aij). It is the result of 
interchanging rows with columns and taking the complex conjugate of each element. The con- 
jugate transpose has several obvious properties. 



(A+B)* = A*+B*, and (AB)* = B*A*. Also (A*)* = A. 



’-1 2]* 


r-i oi 


oij ~ 


2 1 



3 

1+i 




1 



1— i 



) 



' 3 1— i 1 ' 

— i 0 1+i 



If A is any matrix, then S = A*A has the special property that S* = S. Any matrix with this 
special property is called "Hermitian”. 



3 i 
1+i 0 

1 1-iJ 



3 1— i 1 
-i 0 1+i 



r 3i-i i 
l-i 0 1+iJ 



3 i 
1+i 0 

1 1-iJ 




10 3— 3i 2+i 
3+3i 2 1+i 



2— i l-i 3 J 



’ 12 l+2i 

1 — 2i 3 



Looking at the determinant | A* | , it is clear that transposing does not affect it, while taking 
the conjugate of each element causes the determinant to be conjugated. Thus | A* | = | H | . 
For S above | S | = | S* | = | S | whence the determinant is real. 



| A*A | = 1S|. | A | 2:0. 



The inverse also has several simple properties. For one, (AB) -1 = B -1 A -1 . Another 
simple rule is (A*) -1 = (A -1 )*. The inverse ijas no simple rule for sums, and in general (A+B) -1 
A -1 + B -1 . If A is such that A* = A -1 , that is, A*A = I, it is called “orthogonal.” The 
matrices 



’3/5 -4/5 
,4/5 3/5 




i V2\ 
-V2 i J 



are each orthogonal. If A has an inverse, A -1 , then | A -1 1 = l/ 1 A | . For | A | . | A -1 1 = 
| AA -1 1 = 1 1 1 =1. This shows that if | A | =0 then A has no inverse. It can be shown 
that if a square matrix A has no inverse then | A | =0. Thus A is singular if and only if 
|A|-0. 



4, 4 Vectors. 

An important special case of a matrix is the vector, which is n x 1 or 1 x n. That is. 



f 2" 



and (1, i, — 1) are vectors. We will usually mean the vertical version. The product of a matrix 

f3 21 

by a vector, if defined, is another vector, Mv = y. If M = 



0 -1 
1 -1 



and v = j k , then Mv 



3a+2b 
-b 
a— b 



. If Mu = x and Mv = y, then M takes a linear combination of u and v into the 
same combination of x and y. 



CONFIDENTIAL 



37 



ORIGINAL 



REF ID : A64 688 



CONFIDENTI A L 



(4, 4, 1) M(au+bv) = Mau+Mbv = aMu+bMv = ax+by. 

Any transformation u — >x and v — >y which has the property that 

au+bv — * ax+by 



is called a “linear” transformation. Any linear transformation can be represented by matrices. 

It sometimes happens that for a square matrix A and vector v we have Av = Xv, where X 
is a scalar. That is, A effectively only "stretches” v. The vector v is called an "eigenvector” 
of A, and X is called the corresponding "eigenvalue.” 



If A 



3 2 

P-lJ 



then v = 



Another pair for A is v 




and X = —1 are an eigenvector and 
X = 3. 



associated eigenvalue of A. 



A multiple of an eigenvector is also an eigenvector with the same eigenvalue. A-mv = mAv 
= mXv = X-mv. If two eigenvectors of a matrix have the same eigenvalue, then any linear com- 
bination of the two is also an eigenvector. If Au = Xu and Av = Xv, then A(mu+nv) = mAu+ 





2 0 -1 




2 


nAv = mXu+nXv = X (mu+nv). For instance, A = 


0 2 1 

[0 0 3j 


has the eigenvectors u = 


3 

[oj 





r°' 




f 3> 




fo] 




f3' 




v = 


i 

W 


, with eigenvalue X = 2. Then 


2 

IPJ 


+ 


1 

W 




3 

w 


is an eigenvector with eigenvalue 2, 



and 



' 3m 
2m +n 
0 



is also. 



If A is a matrix and if there exists a vector v?*0 such that Av = 0 then A is singular. For 
the assumption that A -1 exists leads to the impossible equation, 

v = Iv = (A -1 A)v = A -1 (Av) = A -1 0 = 0. 

The converse can be proved, but not here. The theorem can then be stated: 

A matrix A annihilates some vector if and only if A is singular. 

c— 1 

For if Av = 0, where v 9* 0, then 2 aijVj = 0, and conversely. This is a necessary and suf- 

j=o 



( 1 —2 

ficient condition that | A j = 0, as was stated at the end of 4, 2. For instance, A = j ^ 2 

. We can now state: A matrix is singular if and only if its determinant is 



annihilates v = 
zero. 



Now if Av = Xv, then 

Av— Xv = 0 

and (A-Xl)v = 0. 

Therefore | A — XI { =0. 

Conversely, if | A-XI | =0, 

then there exists a vector v such that Av = Xv, by the last statement of section 4, 2. 



CONFIDENTIAL 



38 



ORIGINAL 



CONFIDENTIAL 



REF ID:A64688 



Thus a necessary and sufficient condition for X to satisfy a relation Av = Xv (for some v) is that 
X satisfy the equation | A — XI | =0. 

The expression | A— XI | can be expanded by the rules of determinants, and gives a poly- 



nomial in X of degree c, if A is a c x c matrix. 



are found by solving 



l-X -2 
-1 2 — X 



= X 1 — 3X= 0. 



For example, the eigenvalues of A 





They are X = 0 and X = 3. By the funda- 



mental theorem of algebra, the equation | A — XI | =0 has at least one root X , and consequently 
A has at least one eigenvector. In general A has c eigenvalues, not necessarily all distinct. The 



. . (3 -1 
matrix ^ 

associated eigenvector is 



has only the one eigenvalue X = 2, instead of two as might be expected. The 
^ For each eigenvalue distinct from the others, the matrix must 



have an eigenvector linearly independent of the remaining. If A has c different eigenvalues then 
it has c independent eigenvectors. 



For any vectors u and v of the same dimensions the product u*v is well-defined, and is a 
lxl matrix, not ordinarily distinguished from a scalar. It is frequently called the “dot product” 
of the vectors, u-v, and for the 3-dimensional case is the product of the lengths times the cosine 
of the angle between them. 



For example, if u — j 


r ii 

0 




ll 

-l 


, then u*v = ( 1 , 0 , — 1 ) 


f ll 

-1 


= -1. Or if x = 


f 1 ) 

3 , 




l-lj 


, and v = ^ 


2, 




2j 




1 



x*v = 0. If u*v = 0, the vectors u and v are said to be “orthogonal” or at right angles to each 
other. The quantity -v/v*v is called the “length” of the vector. 



If x and y are vectors orthogonal to a given vector v, then any linear combination 



(4, 4, 2) 



ax+by 



is also orthogonal to v. In the example of the previous paragraph, x is orthogonal to v. The 



vector y = 


\-i 

i 


is also orthogonal to v. Therefore ax+by = 


f i 

a— b 
3a +b 




i 




a+b 



is also orthogonal to v. 



The collection of all linear combinations (4, 4, 2) is a "plane” perpendicular to v. The linear 
combinations of more than 3 vectors is called a "hyperplane” perpendicular to v. The collection 
of all vectors perpendicular to v is “the” hyperplane perpendicular to v. An application of these 
concepts is found in section 5, 3. 

Problem: If u is a vector then uu* is a matrix. What are its eigenvectors, and their associated 
eigenvalues? 



Problem: If M is a matrix with the eigenvectors v k , Mv k = v k X k , what are those of the matrix 
I— M? 



Problem: Find the square of I-uu*. 

Problem: If M has eigenvectors v k , what are the eigenvectors of M*? What are the associated 
eigenvalues? 

Problem: If M has eigenvectors v k , and if f(x) is a polynomial, then f(M) is a matrix. What 
are its eigenvectors and eigenvalues? 

Problem: If u and v are two vectors, what is u*v — v*u? 
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4, 5 Geometry. 

The equation v = ua, where a is a scalar parameter, is a parametric equation for a 
straight line through the origin having direction u. The equation v - u«+w represents a line 
through w parallel to u. It will be perfectly general and also convenient to take u of length 1, 
u*u= 1. 

If r and s are two vectors and d the distance between them then r-s is the vector from 
s to r, and d* = (r— s)*(r— s) = r*r — 2r*s+s*s. 

The distance from the line v = ua to the point (vector) r can be found by first finding 
the vector s through r and perpendicular to the line, and then finding its length. Then 
s = r— ua for some a and u*s = 0. To determine a take 0 = u*s = u*(r— uor) = 
u*v — u»ita whence a = u*r, using u*u = 1. N ow the length o f s is y/s*s 
= y/(r — uu*r) *(r — uu*r) = Vr*(I— uu*)*(I— uu*)r = >/r*(I— uu*)r. 

The line L through the point w with the direction of the vector u is v = w+ua, where 
a takes on any scalar value. The distance d from L to r can be found by making 
the transformation v' = v— w, whence L becomes v' = u and r' = r— w. Then 
d* = (r— w)*(I— uu*)(r— w). 

4, 6 The Line of Regression. 

In section 1, 8 the line of regression was defined as that straight line which fitted the data 
best. Now that we have introduced the machinery of vectors and matrices this can be handled 
in the general case and a formula found for the line of regression. 

Suppose the data given is the set of vectors v* k — 1,2, . . . , m, each real and of dimension 
n. For example census data, each datum being the age, income, height, etc. of an individual. 

1 m m 

Let r = — z v k . Let M = Z (r k — r) (r k — ?)* , which will be needed later. 
m k=l k=l 

If the line of regression goes through the point w with the direction u then the square of the 
distance from r k to the line is (r k -w)*(I-uu*)(r k — w), and the sum of the squares of these 
distances is 

m 

(4,6,1) Z (r k — w)*(I — uu*)(r k — w) 

k = l 

m m 

= Z (r k — w)*(r k -w) - Z (r k — w)* uu*(r k — w) 
k=l k=l 

m 

= t(M) - Z u* (r k — w) (r k — w) *u 
k = l 

= t(M) — u*Mu. 

The restriction u*u = 1 can be imposed on u for convenience. Now introduce the Lagrange 
multiplier X and minimize 

m m 

S(u,w) = Z (r k -w)*(r k -w) — u* Z (r k — w)(r k — w)*u + X u*u. 
k=l k=l 
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A necessary condition for a minimum is that V u S=0 and V„S = 0. 



0 = p u S = —2 2 (r k -w)(r k -w)*u + 2 X u 
k = l 



0 = VJ3 = -2 2 r k +2mw - 22ur k *u + 2muw*u. 
k=l 



These reduce to 



(4, 6, 2 ) 



Mu = X u 



(I-uu*) (r— w) = 0. 



PL 86-36/50 USC 
EO 3.3(h)(2) 



The first of these says that u is an eigenvector of M and X the corresponding eigenvalue. 
Multiplying both sides by u* we get u*Mu = X so that (4, 6, 1) becomes S(u,w) = t(M) — X, 
from which we infer that X is the largest eigenvalue of M. 

The second equation shows that the distance of r from the line, (r— W)*(I— uu*) (r— w), 
is 0. Thus the line goes through r, the center of gravity. 



4, 7 Examples of Cryptologic Applications. 
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1 c — 1 c— 1 

= - 2 2 f k e« k_n)w 

C j=0 k = 0 



1 c — 1 c — 1 
= - 2 f k 2 e J(k-n)w 
C k=0 j=0 

= fn. 



since 
(6, 1, 4) 



fO if m * 0 



c— 1 

2 e Jmw = ^ 
j*0 

c — 1 



c if m =0. 



The transformation (6, 1, 2) F = 2 f k e jk> replaces the frequency distribution {L} by an 

k=0 

equivalent set of c numbers (Fj} . That it is equivalent is shown by equation (6, 1, 3), which 
computes {f} in terms of {F} . 



Example: 



c = 4, f 0 *= 3, f i = 2, f*= 1, f j = 3. 

2»i ri 

W “ 4 2' 



_ *i 

a = e* =e — = i. 

x 



Therefore 



«* = — 1,«* = — i, and a* * w° = 1. 
Fo = fo + fi + f* + fi =9 

Fi = fo 4- wfi w*fj -f «*f, “2-1. 
Fj = fo + w*fi ■+■ w°f* -H w*f* = — 1. 

F* = fo -J- wf, 4" w*fi 4" wfi * 2 4" i- 



(6, 1, 5) *(t) = 1/4 [94-(2-i)e-‘-e-“4-(24-i)e~*‘] 

*(0) - 1/4 [12] - 3 - f # 

*(w) = l/4[9-(2-i)i+l4-(24-i)il = 1/4 [8] - 2 - f„ etc. 

The coefficients Fi are in general complex imaginary, and satisfy the relation F e _i = Pi, 
see (6, 2, 12) below. Thus the number of degrees of freedom in the statistic is still c. Also 



c— 1 

notice that F 0 = 2 fi is the total number of letters in the count, 
i =0 
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This “characteristic" or “Fourier transform” can be used instead of the frequency distri- 
bution. Anything which can be done with one can be done with the other, with perhaps less 
difficulty. There are situations in which a small subset of {F} has most of the information of 
{f} and is more easily handled. For instance, if c is even, F, (1 is the “parity double bulge". 
There are favorable situations in which the parity (odd or even) alone is sufficient to betray the 
pattern of a cyclic component. In these situations F e/ * is a sufficient statistic. In other situ- 
ations similarly another coefficient Fi, or a set of two or three of them, may be sufficient. The 
reader who has used Fourier analysis for a curve fitting problem such as spectral analysis will 
recognize the method. The main difference is that here we are dealing with the mod 26 ring 
rather than a continuous range. 

6, 2 Properties of the Fourier Transforms. 

If {x} and {y} are two frequency distributions, and if {X} and {Y} are the corre- 
sponding Fourier transforms, then {f, = Xj+yj} is a frequency distribution, and its Fourier 
transform {F} is the sum of those of the components, 



(6, 2, I) Fk = Xk+Yk. 

This follows immediately from formula (6, 1, 2). As a consequence the transforms of the fre- 
quency distributions 

(6.2.2) {f>} = {1,0,0 ,0} , {f>} = {0,1,0,. . . ,0}, 

and so forth to {f® -1 } — {0, 0, 0, . . . , 1 }, can be calculated in advance, and the transform 
of any distribution found by adding the proper number of each. 

Let {F 1 } be the corresponding transforms. Thai 

(6.2.3) {F»} = {1,1,1 1} 

{F 1 } * { l,w, w*, . . . w 0-1 }, 

and so forth. In general Fj k =o> lk , a complex value. 



The example of 6, 1 done this way is as follows. The distribution 
{1, 0, 0, 0} has the Fourier transform {l, 1, 1, 1} 

{0,1,. 0,0} has {l,i, -1, -i} 

{0, 0, 1, 0} has {1, -1, 1, -1} 

{0,0, 0,1} has {1, -i, — 1, i}- 

Now the transform of {3, 2, 1, 3} is 

3 {1, 1, 1, 1} +2 {1, i, -1, -i} + {1, -1, 1, -1} 
+ 3 {1, -i, 1, i} = {9,2-i, — 1, 2+i}. 
The expected value of F can readily be calculated. 

c— 1 

(6, 2, 4) E(Fj) - 2 E(fk)«» k , 

k-0 

since the expected value of a sum is the sum of the expected values. 



CONFIDENTIA L 



48 



ORIGINAL 



CONFIDENTIAL 



REF ID:A64688 



Other moments can also be obtained from those of { f } , such as 

c— 1 c— 1 

(6 , 2 , 5) E(FjFk) - 2 2 E(f k f m ) w i k+h "‘. 

k=0 m =0 



If we assume that f is multinomially distributed (a frequent case) 



then 


N c ^ 

E(f k ) = — , where N = 2 f] 

c k=0 


and then 


(6, 2, 6) 
if MO, 


N c * 

E(Fj) = — 2 u ik = 0. 
c k=0 

c— 1 


(6, 2, 7) 
Since 


F 0 - 2 f k =N. 
k=0 


(6, 2, 8) 


E(f k f„) = N, ~ N , k*m, 
c* 


and 


(6, 2 , 9) 

we can calculate E(FjF h ). 


E(f k *) - N ’ +(c 1)N , 
c* 



( 6 , 2 , 10 ) 



This gives us five cases. 



(6, 2 , II) 



c— 1 c— 1 

E(FjF h ) =2 2 E(f k f m )« ik+hm 

k=0 m =0 



N*— N 



c— 1 

2 2 o» ,k+h “ 



c * m =0 k*m 
c— 1 



N*+(c— 1)N 



and 



f E(F 0 *) - N* 

E(FaFh) * 0 

i E(FjFh) - N*-N 

E(FjF_j) = N 
l.E(F i *) - 0 



S w k(j+h)_ 

k=0 



h^O mod c 

MO, 

j +h mod c 
MO mod c 
MO. 
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The condition j+h ?£ 0 is a little surprising. Examination reveals that 



c — 1 __ 

(6,2,12) F e _j = 2! fk« ik = Fj, 

k=0 

the conjugate of Fj. 



Now in dealing with complex statistics the definition of variance is 



(6, 2, 13) 



«r*(z) = E(zz) — E(z)E(z). 



Thus we have 
(6, 2, 14) 

(6, 2, 15) 



<r*(Fi) - E(FjFj) - E(Fj) E(F,) 

= E(FjF e _j) - 0 = N for j ^ 0, 

<r J (F 0 ) = N* - N N = 0. 



The covariance of two complex variables is defined as 

(6, 2, 16) nu{x,y) = E(xy) - E(x) E(y). 



Thus 
(6, 2, 17) 



A«ii(FiFh) = E(FjFh) - E(Fj) E(F^ 



= (E(FjF e _h) - E(Fj) E(F c _h) 




0 if jh = 0 
N if j = h * 0 
N*— N if j * hand jh * 0. 



c— 1 c— 1 

(6,2,18) 2 |Fj | s == c 2 f k * - N*y. 

j =0 k=0 

c— 1 

Proof: 2 | Fj | * = 2 FjF.j 

i=o j 



c— 1 

= 222 f k f.w« k ~"> = c 2 f k », 
j k n k = 0 

as is seen by changing the order of summation and using (6, 1, 4). 



6, 3 Real Part. 

It is possible to work with real values exclusively by considering only the real value of the 
transform. Let Rj be the real and Sj be the imaginary part of Fj. Then 



(6, 3, 1) 



Fj = Rj+iSj. 



By (6, 1, 1) the real part 
(6, 3, 2) 



Tj.^.&liri... l c * _ 

R(*( )) = - 2 R(Fje - 

C C }=0 



2jniri 
c ) 
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and the imaginary part is 



(6, 3, 3) 
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1 C ~ 1 „ . 2nx 

= - 2 Rj cos j — 

c j =0 c 



1 c „ * . . 2nr 

2 Sj sin j . 

c j-0 c 



Since by (6, 1, 2) F is additive, so are R and S. 

Since 4(0) = f 0 is real, 

1 C_1 

- 2 Rj=f 0 . 
c j =0 

1 C “ 1 Q . 2nr 

- 2 Sj sm — = 0. 



Since 4(nw) = f n is real. 



Also 



(6, 3, 4) 



c j=0 c 

c— 1 c— 1 c— 1 2 

2 Sj = — 2 2 f k sin (jk — ) 

j=0 j=0 k = 0 c 

c— 1 c— 1 2 

= — 2 f k 2 sin (jk — ) 
k=0 j =0 



- 0 . 

In the example of 6, 1 the real parts of the transforms are 9, 2, —1, and 2. The imaginary 
parts are 0, —1, 0, and 1. 



6, 4 Absolute Value. 

There is another version of the transform which is sometimes useful. This is 
(6,4,1) Tj-|F,|. 

If one is dealing with a cipher system which applies a “slide", then this absolute value has a useful 
property. A slide is a known substitution with a single cycle, or a power thereof. That is, with 
a slide of s, the frequency distribution {f k } becomes {g k = f k _,} . All subscripts are taken mod c. 
If | G | is the transform of {g} 

c— 1 

then Gj = 2 g k «' k . 

k = 0 

c — 1 

Gj - 2 g k+ .« i(k+,) 

k=0 
c — 1 

Gj = 2 f k « Jk « J * 

k=0 
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(6, 4, 2) Gj = w^Fj, 



(6,4,3) T, = IFJ = | »“'• | . | G, | = |Gi| . 

Thus the transform Tj is invariant under a slide. Unfortunately T is not additive like F, R, 
and S. 

In the example of 6, 1 the absolute values of the transforms are 9, -s/5, 1, and -s/5. If a 
slide of s * 1 is applied to the frequency count it becomes 3, 3, 2, 1. The Fourier transforms of 
these are 9, l+2i, 1, and l-2i, quite different from 9, 2— i, -1, and 2+i. The absolute values 
are, however, the same. 

6, 5 Application to Minuend Systems. 




7. Theory of CirculiceB. 

In certain algebraic and statistical procedures special matrices arise of the type 

I a 0 &i a s ■ ■ • &c~i 

I ftc- 1 &Q &1 • s • H e - 1 



[ Hi &s a* soe Ho J • 

That is, each row is a slide to the right of the row above. Such a matrix is called a "circulix.’ 
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7, 1 Enciphering Equations. 



7, 2 Properties of Circullces. 

The sum of circulices is a circulix, the zero matrix (every element 0) is a circulix, and the 
negative of a circulix is a circulix. Therefore they form an additive group. 

The product of circulices is also a circulix, as is seen by (7, 2, 1) 



(aj_i) (bj_i) = ( 2 a k -ibj_ k ) - (zu). All subscripts are mod c. 
k-0 

c— 1 

Zi +t , j + . — 2 ak_,_i bj + ,_ k 

k-0 
c— 1 

— 2 ah- 1 bj_k 

h-0 

=* Zij» 

where h = k — s. This establishes that the product is a circulix. 

Circulices are permutable, for 

c— 1 

(7, 2 , 2 ) (bj-0 (aj_ ,) = ( 2 b k -i a,_ fc ) 

k-0 

If we put k= i+j— h mod c, 

c — 1 c— 1 

2 bk-iaj-k * 2 bj_h a h -i, 
k-0 h-0 



which is identical with (7, 2, 1). 

If a circulix has an inverse, then that inverse is a circulix. Suppose 



(7, 2, 3) 


(*u) 0>i-i) = ($u) =■ L 
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c— 1 

Then Z xi k bj_ k - 5 ij for all i and j. Replace i by i+a, j by j+s, and let h = k — s, getting 
k-0 

c— 1 

(7» 2, 4) 2 x i+li k+ , bj_h = Si+i, j + , = Sij . 

h = 0 



Thus if (xij) is a solution of (7, 2, 3), then (x i+ ., i+m ) is also a solution. If (bj_0 has an inverse, 
then it is unique and (x 1+ ,, J+i ) = (xij), showing that the inverse is a circulix. 



If « is a primitive cth root of 1, then v 



CO 



is a vector which is merely stretched when 



multiplied by a circulix A. For 

Av - (a,_i) (»*) 




(7, 2, 5) 



f c — 1 

z an-iw*-* 

k-0 



c— 1 

- v 2 a k _i« k_l . 
k-0 



c— 1 



Thus v is an “eigenvector” of A, and F = 2 a k _i« 

k-0 

sponding “eigenvalue". 

If v k - («**), then it also is an eigenvector of A. 

Av k = (aj_i) («'*) - 



k-i , which is independent of i, is the corre- 
For 

Sa k _i« kt l 

Ik J 



c— 1 

- u u 2 a k _i« (k_1 >‘ - v* F* 
k-0 



c— 1 

(7, 2, 6) where F t - 2 a k _i « (k-,)t is independent of i. 
k-0 

The c vectors v ( are seen to be linearly independent. We can combine them into the square 
matrix 



V = (««)■ 



CONFIDENTIAL 



64 



ORIGINAL 



CONFIDENTIAL 



REF ID:A64688 



Now AV = VL where L is the diagonal matrix with the eigenvalues F t as diagonal elements. 
Since the v t are linearly independent V has an inverse V -1 , so 

(7,2,7) V- 1 AV - L. 

The eigenmatrix V depends only on the fact that A is a circulix, and not on the a t . Thus all 
circulices can be transferred to diagonal form L by the same matrix V. The diagonal elements 
are the eigenvalues, Ft. 

If A and B are two circulices, they are each transforms of diagonal matrices, A= VLV -1 
and B= VMV' 1 . 



(7, 2, 8) A+B = VLV -1 +VMV -1 = V(L+M)V->. 

That is, the sum of two circulices has for its eigenvalues the sum of the respective eigenvalues of 
the summands A and B. The order of the eigenvalues is determined uniquely by V. 

(7, 2, 9) AB = VLV -1 . VMV -1 = VLMV -1 . 

The product of two circulices has for its eigenvalues the products of the respective eigenvalues of 
the factors. 

7, 3 Fourier Transforms and Circulices. 

Looking back to formula (6, 1, 2) 

o—l „ ■ 

Fj = 2 f k e Jkw , where w — — , 
k-0 c 

we see that if u= e w this formula is identical with (7, 2, 6) 

c— 1 

Fj = 2 akw* 1 , 

k-0 

where a k - f k . Thus the eigenvalue Fj is identical with the Fourier transform F,. 

We repeat here the results from the Fourier transform theory stated in terms of eigenvalues. 
The eigenvalues are in general complex numbers. If the circulix is real, then the values are con- 
jugate in pairs, 

F,-j - F,. 



c— 1 

The first eigenvalue F« = 2 a k . If the circulix comes. from a frequency count, F 0 is the total 
count. k-0 

If c is even, F« measures the deviation from random mod 2, and has been used by itself 
i 

to place cribs and set messages. 

If f is a circulix of frequencies, then the sum of the diagonal elements of f*f are given by 

c— 1 c— 1 

2 f k *. Therefore the trace (the sum on the principal diagonal) of f*f is c 2 f k * = N 5 ?. 
k-0 k-0 
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Since the trace is invariant under matrix transformation, therefore the trace of L*L is the same 
c— 1_ 

2 FjFj = N J . See (6, 2, 18). 
j=0 

7, 4 Polynomials and Circulices. 

If A is a circulix with elements a k , we can use these elements to define a polynomial. 

c— 1 

(7, 4, I) A(x) = 2 a k x k . 

k=0 



The variable x is an indeterminate, 
circulix. 

The product of two circulices 



Any polynomial A(x) of the form of (7, 4, 1) determines a 



AB = (*) (b t ) 



"c — 1 

2 a k b t - k 

Ik =0 



c— 1 c— 1 

(7, 4, 2) defines the polynomial 2 2 a k b t _ k x*. The product of the polynomials 

t =0 k=0 



c— 1 c— 1 c— 1 c— 1+k 

(7, 4, 3) A(x) B(x) = 2 2 a k b m x k+m =2 2 a k b t _ k x*. 

k=0 m-0 k=0 t=k 

Comparing this with (7, 4, 2), we see it differs only in the range of t. In (7, 4, 2) the subscripts 
are understood to be modulo c. To impose the same convention on (7, 4, 8) would mean to 
interpret the exponent on x modulo c also. That is, x e+k to mean x k , and x®*' 9 — x* ■■ 1, or 
x" —1 = 0. If the product A(x) B(x) of the polynomials is taken modulo x*— 1 the corre- 
spondence between the circulices and file polynomials is an isomorphism.* 

Reference to (7, 2, 6) shows that the eigenvalues are merely specific values of the polynomial, 
(7, 4, 4) F, - A(«i). 

The c eigenvalues can be used to define a new polynomial, 



c— 1 



(7, 4, 5) 


E(x) = 2 FjX J . 
j-0 


Then the eigenvalues of E are 


c— 1 



E(« k ) - 2 Fj« lk = ca k . 
j-0 

See (6, 1, 1) and (6, 1, 2). Thus the two polynomials are symmetrically related; each can be 
derived from the other. 

♦This was pointed out to me by LTJG William B lan kinship. 
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